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Abstract: The increasing nature of World Wide Web 
has imposed great challenges for researchers in 
improving the search efficiency over the internet. Now 
days web document clustering has become an 
important research topic to provide most relevant 
documents in huge volumes of results returned in 
response to a simple query. In this paper, first we 
proposed a novel approach, to precisely define 
clusters based on maximal frequent item set (MFI) by 
Apriori algorithm. Afterwards utilizing the same 
maximal frequent item set (MFI) based similarity 
measure for Hierarchical document clustering. By 
considering maximal frequent item sets, the 
dimensionality of document set is decreased. Secondly, 
providing privacy preserving of open web documents 
is to avoiding duplicate documents. There by we can 
protect the privacy of individual copy rights of 
documents. This can be achieved using equivalence 
relation. 

Keywords: Maximal Frequent Item set, Apriori 
algorithm, Hierarchical document clustering, 
equivalence relation. 

I. INTRODUCTION 

Document clustering has been studied intensively 
because of its wide applicability in areas such as web 
mining, search engines, text mining and information 
retrieval. The rapid progress of databases in every 
aspect of human actions has resulted in enormous 
demand for efficient algorithms for spinning data into 
valuable knowledge. 

Document clustering has undergone through 
various methods, still document clustering is in its 
inefficiency state for providing the required 
information needed by the user exactly and 
approximately. Suppose the user makes an incorrect 
selection while browsing the documents in hierarchy. 
If user may not notice his mistakes until he browses 
into the deep portion of the hierarchy, then it decreases 
the efficiency of search and increases the number of 



navigation steps to find relevant documents. So we 
need a hierarchical clustering that is relatively flat that 
reduces the number of navigation steps. Therefore 
there is a great need for new document clustering 
algorithms, which are more efficient than conventional 
clustering algorithms [1,2]. 

The increasing nature of World Wide Web has 
imposed great challenges for researchers to cluster the 
similar documents over the internet and their by 
improving the efficiency of search. Search engine uses 
the getting more confused in selecting the relevant 
documents among huge volumes of search results 
returned to a simple query. A potential solution to this 
problem is to cluster the similar web documents, which 
helps the user in identifying the relevant data easily 
and effectively [3]. 

The outline of this paper is divided into six 
sections, section II, briefly discusses related work. We 
explained our proposed algorithm description 
including common preprocessing steps and pseudo 
code of algorithm in section III. It also includes to 
precisely defining clusters based on maximal frequent 
item set (MFI) by Apriori algorithm. Section IV, 
describes exploiting the same maximal frequent item 
set (MFI) based similarity measure for Hierarchical 
document clustering with running example. In section 
V, provides privacy preserving of open web 
documents by using equivalence relation to protect the 
individual copy rights of a document.. Section VI, 
consists of conclusion and future scope. 

E. RELATED WORK 

The related work of using maximal frequent item 
set in web document clustering is explained as follows. 
Ling Zhuang Honghua Dai [4] introduced a new 
criterion to specifically locate the initial points using 
maximal frequent item set. These initial points are then 
used as centers for k-means algorithm. However k- 
means clustering is completely unstructured approach, 
sensitive to noise and produces an unorganized 
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collection of clusters that is not favorable to 
interpretation [5, 6]. To minimize the overlapping of 
documents, Beil, Ester [7] were proposed a method 
HFTC (Hierarchical Frequent Text Clustering) is 
another frequent item set based approach to choose the 
next frequent item sets. But the clustering result 
depends on the order of choosing next frequent item 
sets. The resulting hierarchy in HFTC usually contains 
many clusters at first level. As a result the documents 
in the same class are to be distributed into different 
branches of hierarchy, which decreases the overall 
clustering accuracy. 

C.M.Fung [8] has introduced FIHC (Frequent Item 
set based Hierarchical Clustering) method for 
document clustering. Which employed, a cluster topic 
tree is constructed based on the similarity among 
clusters. FIHC used the efficient child pruning when 
number of clusters is large and to apply the elaborated 
sibling merging only when number of clusters is small. 
The experiment results FIHC actually outperforms all 
other algorithms (bisecting-k means, UPGMA) in 
accuracy for most number of clusters. 

The Apriori algorithm [9] is a well-known method 
for computing frequent item sets in a transaction 
database. The document under the same topic, shares 
more common frequent item sets (terms) than the 
documents of different topics. The main advantage of 
using frequent item sets is that it can identify the 
relation among the more than two documents at a time 
in a document collection unlike similarity measure 
between two documents [10, 11]. By the means of 
maximal frequent item sets, the dimensionality of the 
document set is reduced. More over maximal frequent 
item sets captures most related document sets. On the 
other hand, hierarchical clustering most relevant for 
browsing and maps most specific documents to 
generalized documents in the whole collection. 

A conventional hierarchical clustering method 
constructs the hierarchy by subdividing parent cluster 
or merging similar children clusters. It usually suffers 
from its inability to perform tuning once a merge or 
split decision has been performed. This rigidity may 
lower the clustering accuracy. Furthermore, due to the 
fact that a parent cluster in the hierarchy always 
contains all objects of its Childs, this kind of hierarchy 
is not suitable for browsing. The user may have 
difficulty to locate his intention object in such a large 
cluster. 

Our hierarchical clustering method is completely 
different. The aim of this paper is, first we form all 
the clusters by assigning documents to the most similar 
cluster using maximal frequent item sets by Apriori 
algorithm and then construct the hierarchical 
document clustering based on their inter-cluster 
similarities via same maximal frequent item set (MFI) 



based similarity measure . The clusters in the resulting 
hierarchy are non-overlapping. The parent cluster 
contains only the general documents. 

III. ALGORITHM DESCRIPTION 

In this section, we explained our proposed 
algorithm description including common 
preprocessing steps and pseudo code of algorithm. It 
also includes to precisely defining clusters based on 
maximal frequent item set (MFI) by Apriori algorithm. 
First, we will speak about some common 
preprocessing steps for representing each document by 
item sets (terms). Second we will bring in vector space 
model by assigning weights to terms in all document 
sets. Finally, we will explain the process of 
initialization of clusters seeds using MFI to perform 
hierarchical clustering. Let Ds represents set of all 
documents in collection of database. 

Ds= {dl,d2, d3 d M }: 1 <i<M 

A. Pre-Processing 

The document set Ds is converted from 
unstructured format into some common representation 
using the text preprocessing techniques, in which 
words or terms are extracted (tokenization). The input 
data set of documents in Ds are preprocessed using the 
techniques namely, removing HTML tags first, after 
that apply stop words list and stemming algorithm. 

a) HTML Tags: parsing of HTML Tag 

b) Stop words: Remove the stop words list like 
"conjunctions, connectives, prepositions etc" 

c) Stemming algorithm: We utilize porter 2 
stemmer algorithm in our approach. 

B. Vector representation of document: 

Vector space model is the most commonly used 
document representation model in text mining, web 
mining and information retrieval areas. In this model 
each document is represented as n-dimensional term 
vector. The value of each term in the n-dimensional 
vector reflects the importance of corresponding 
document. Let N be the total number of terms and M 
be the number of documents and each the document 
can be denoted as 

Dj = {term ilt term i2 , term in ) 1< i< M. Where 

df(term.ij) < threshold value. The document 
frequency term i; is less than the threshold value is 
considered to avoid the problem of more times a term 
appears throughout all documents in the whole 
collection, the more poorly it discriminates between 
documents [12].Calculate term frequency tf is number 
of times a term appears in a document. Document 
frequency of a term df as no of documents that 
contains term. Also construct the weights for 

documents vectors. D t = (Wj 1( w 12 , w 13 , ,wl in ) 

Where w i} = tf i} * IDfQ) and 
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IDf (j) =log (^f)l<j<n. where IDf is the 
document frequency. 



inverse 



Table 1: Table Representation of Transactional Database of 
Documents 



Terms 


Doc 1 


Doc 2 


Doc 3 




Doc 4 


Java 


1 


1 


0 




1 


Beans 


0 


1 


0 




0 














Servlets 


1 


0 


1 




1 



By the representation of document as vector form, 
we can easily identify which documents Contains the 
same features .The more features documents have in 
common, the more related they are. Thus, it is realistic 
to find well related documents. Assume that each 
document is an item in the transactional database; each 
term corresponds to a transaction. Our aim is to search 
for highly related documents "appearing" together 
with same features (the documents whose MFI features 
are closed). Similarly, the maximal frequent item set 
discovery in the transaction database serves the 
puipose of finding items of documents appearing 
together in many transactions, i.e., document sets 
which have large amount of feature in common. 

C. Apriorifor maximal frequent item sets 

Mining frequent item sets is a primary content of 
data mining that emphasizes particularly in finding the 
relation of different items in the large database. Mining 
frequent patterns is crucial problem in many data 
mining applications such as the discovery of 
association rules, correlations, multidimensional 
patterns, and other numerous important inferring 
patterns from consumer market basket analysis and 
web access etc. The association mining problem is 
formulated as follows: Given a large data base of set of 
items transactions, find all frequent item sets, where a 
frequent item set is one that occurs in at least a user- 
specified threshold value of the data base. Many of the 
proposed item set mining algorithms are a variant of 
Apriori, which employs a bottom-up, breadth first 
search that enumerates every single frequent item set. 
Apriori is a conventional algorithm that was first 
introduced] for mining association rules. Association 
can be viewed as two-step process as 

(1) Identifying all frequent item sets 

(2) Generating strong association rules from the 
frequent item sets 

At first, candidate item sets are generated and 
afterwards frequent item sets are mined with the help 
of these candidate item sets. In the proposed approach, 
we have used only the frequent item sets for further 
processing so that, we undergone only the first step 
(generation of maximal frequent item sets) of the 
Apriori algorithm. 



A frequent item set is a set of words which occurs 
frequently together and are good candidates for 
clusters and are denoted by FI. An item set X is closed 
if there does not exist an item set XI such that XI, 
such that X c XI and t(X) = t(Xl), where t(X) defined 
as the set of transactions that contain item set X and it 
is denoted by FCI(frequently closed items). If X is 
frequent and no superset of X is frequent among the 
set of items I in transactional databases. Then we say 
that X is maximal frequent item set and denoted by 
MFI. Then MFIc FCI c FI Whenever there are very 
long patterns are present in the data it is often 
impractical to generate the entire set if frequent item 
sets or closed item sets [16]. In that case, maximal 
frequent item sets are adequate for such applications. 
We employed maximal frequent item set algorithm 
from [17] using apriori. These maximal frequent item 
sets are initial seeds for hierarchical document 
clustering. 

D. Pseudo code Algorithm 

For MFI Based Similarity Measure for Hierarchical 
Document Clustering 

Input: Document set D s . 

Definition: MFI: Maximal Frequent Item set. 

(tf) Term frequency and (df) document frequency 

Step 1. For each document in D s , Remove the HTML 
tags and perform stop word list and stemming. 

Step 2. Calculate the term frequency (tf) and document 
frequency (df). 

D t = (terrain, term i2 , term in ) l<i<M 

Where df (term^) < Threshold value 

Step 3. Also construct the weighted document vectors 

for all the documents 
D; = (w tl , w 12 ,w 13 , .wlj n ) Where w t j = tf t j * 

IDf(J).mQ)=log(^j l<j<n. 

Step 4. Now represent each documents by keywords 
whose tf>support 

Calculate the Maximal Frequent Item set(MFI) of 
terms using Apriori algorithm 
MFI = {F 1 ,F 2 ,F 3 F n ] 

Where each F t = {d lt d 2 , d 3 , d k } 

Step 5. If a document d t is in more than one maximal 
frequent item set then choose I d as a set 
consisting of such maximal frequent item sets 
containing document d t Then Assign/ % =/ d0 .For 
each the maximal frequent item sets containing 
the document d t 

If\jaccards(center ( I x , d t )) 

> j accords (center (Idudt))] 
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Then assign I x = I di . Assign the document d;to I x 
and discard d; for other maximal frequent item sets. 
Repeat this process for all documents that occurs in 
more than one maximal frequent item set 

Step 6. Apply hierarchical document clustering to make 
these maximal frequent item sets Fj as clusters 
and combine the documents in F; into a single 
new document and represent it by centers of the 
maximal frequent item sets. These are obtained 
by combining the features of maximal frequent 
item set of terms that grouping the documents 

Step 7. Repeat the same process of hierarchical 
document clustering based on maximal frequent 
item sets for all levels in hierarchy and stop if 
total number of documents equals to one else go 
to step 4. 

IV. HIERARCHICAL CLUSTERS BASED ON 
MAXIMAL FREQUENT ITEM SETS 

After finding maximal frequent item sets (MFI) by 
using Apriori algorithm. We turn to describing the 
creation of hierarchical document clustering using 
same similarity measure by MFI. A simple instance 
case of example is also provided to demonstrate the 
entire process. The set of maximal frequent item sets 
among the whole collection of documents D s by 

apriorialgorithm are MFI = {F 1 , F 2 , F 3 F n }.Where 

each MFI consist of set of documents represented 

byF[ = {d 1 , d 2 , d 3 d fe }.Then consider total number 

of documents which occurs in maximal frequent item 
sets in MFI as follows. 



MFI 



{d-^f d 2 , d 3 d 4 , d^f dg, dj t dg, 
dg, d w , dii, d-12, d^ 3 , ^i4< ^15 J 



F\ — {^2< d 4 , d 6 } 
F 2 = [d 3 , d 4 , d 8 } 
F 3 = {d 1( d 5 , d 7 ] 
F 4 — {d 4t d 2 , d^ 4 } 

^5 = {^10> ^12' ^15} 

F 6 = {d 9 , d tl , d 13 ] 

The clusters in the resulting hierarchy are non- 
overlapping. This can be achieved through the 
following cases. 

Casel: If Fi, Fj are same then choose one in random 
to form cluster. 

Case2: If F^ F,- are different then form clusters of 
documents contained inF;,F ; - independently. In our 
example, the maximal frequent item set of documents 
in F 3 , F 5 and F 6 are different. So we form a clusters 
according to the documents contained in 



Fj like F 3 = {d 1( d 5 , d 7 } as one cluster in hierarchy 
and represent it by center (as in step6). 

Case 3: If F^Fj contains some same documents 
among the documents list obtained from MFI. Let us 
consider the case of document d 2 is repeatedin more 
than one maximal frequent item sets{F 1 F 4 }. Similarly 
d 4 is repeated in{F 1( F 2 , F 4 }. Then choose/ d = 
{F 1( F 2 , FJ = {/ d0 , I dl , I d2 }for documentd 4 . Assign 
Ix=Ido = F\. F° r eacn tne maximal frequent item sets 



in I d containing the document 
I do £° /d2 ca l cu l ate tne measure 



d A 



from 



I f\jac cards (center ( I x , d 4 )) 

> jaccards (center (I d i,d 4 ))] 

By using this jaccards measure, we can identify the 
document d 4 closest to which maximal frequent item 
set among maximal frequent item sets containing the 
document d 4 .Then assign I x = I di . 

Let's suppose that d 4 is closed to the maximal 
frequent item set F 4 . Assign the documentd 4 to/ x = 
I di = F 4 and discard d 4 for other maximal frequent 
item sets. After this step, each document belongs to 
exactly one cluster. Similarly d 2 belongs toF^ Repeat 
this process for all documents that occurs in more than 
one maximal frequent item set. Since the documents 
d 2 , d 4 are repeated inF 1( F 4 . The clusters that will form 
at the first level of hierarchy by applying step5 and 
step 6 are as follows. 

F 1 = [d 2 , d 6 } 

F 2 = {d 3 ,,d 8 } 
F 3 = {d 1( d 5 ,d 7 } 
F 4 = {d 4 , , d 14 } 
^5 = {^io< d 12 , d 15 } 
F 6 = {d 9 , d tl , d 13 ] 

The hierarchical diagram for the above form of 
maximal frequent item set clusters can be representing 
as follows. Repeat the same process of hierarchical 
document clustering based on maximal frequent item 
sets for all levels in hierarchy and stop if total number 
of documents equals to one else go to step 4. 



Level 1: 



111 



-L» 



L12 



L23 



L24 



L16 



Level 0: 

d2 d6 t)3 d8 dl d5 d7 d4 dl4 dlO dl2dl5 d9 dll d3 
Figure 1: Hierarchical document clustering using MFI 
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Represent each new document {Lij} in hierarchy by 
maximal frequent item set of terms as centers (as in 
step 6). These maximal frequent item sets are obtained 
by combining the features of maximal frequent item 
set of terms that grouping the documents. Each new 
document also consisting of corresponding updated 
weights of maximal frequent item set of terms. Where 
{Lij} represents that j th document in the level of 
hierarchy!;. In the figure {L 12 = L 2 i}means that the 
maximal frequent item set of terms in 2 nd document of 
level L 1 are not matched with other documents MFI set 
in same levelL-L.So it is repeated same for the next 
level and it is also same for the document {L 13 = 



11 

itself. When we are classifying the documents into 
equivalence classes, we are not considering these ones 
and put zeros. Jaccard similarity coefficient matrix for 
four documents can be represented as follows. 



Ra — 





dl 


d2 


d3 


d4 


d\ 


" 1 


0.4 


0.8 


0.5 


di 


0.4 


1 


0.8 


0.4 


d-i 


0.8 


0.8 


1 


0.9 


d4 


0.5 


0.4 


0.9 


1 



L 22 }- The documents{L 1:L , L 15 } and{L 14 , L 16 } in first 
level are combined using MFI based hierarchical 
clustering and represent these documents in the second 
level as L 23 ,L 2 ^. 

V. PRIVACY PRESERVING OF WEB 
DOCUMENTS USING EQUIVALENCE 
RELATION 

Most internet web documents are publicly available 
for providing services required by the user. In such 
documents there is no confidential or sensitive data 
(open to all). Then how can we provide privacy of 
such documents. Now a days, same information will 
be exists in more than one document in duplicate 
forms. The way of providing privacy preserving of 
documents is by avoiding duplicate documents. There 
by we can protect the privacy of individual copy rights 
of documents. Many duplicate document detection 
techniques are available such as syntactic, URL based, 
semantic approaches. In each technique, a processing 
overhead of maintaining shingling' s, signatures, 
fingerprints [13, 14, 15, 18]. In this paper, we 
proposed a new technique for avoiding duplicate 
documents using equivalence relation. Let Ds be the 
input duplicate document set is subset to web 
document collection. First find the jaccard similarity 
measure for every pair of documents in Ds using 
weighted feature representation of maximal frequent 
item sets discussed in step 2 and step 3 in algorithm. If 
the similarity measure of two documents is equal to 1, 
then the two documents are most similar. If the 
measure is 0, then they are not duplicates. The Jaccard 
index or the Jaccard similarity coefficient is a 
statistical measure of similarity between sample sets. 
For two sets, it is denoted as the cardinality of their 
intersection divided by the cardinality of their union. 
Mathematically 



D 



/(di, d 2 ) = 



\dt n d 2 \ 
|di n d 2 \ 



For every pair of two documents calculate jaccard 
measure of dl, d2.All the diagonal elements in matrix 
are ones, because every document mostly related to 



Where alpha is threshold. Let define a relation R on 
= {d 1( d 2 , d 3 , d 4 }as the collectionof document pairs 
whose similarity measure is above some threshold 
value, i.efl = {(d^dj)/ ] (d t ,dj) > threshold } 



1. R is reflexive on Ds iff R {d t , d{) = 1. i.e Every 
document is mostly related to itself. 

2. R is symmetric on Ds iff/? dj) = R {dj, d;)i.e 
if the document d t is similar to dj then the 
document dj is also similar tod t . 

3. R is transitive on Ds iff 

R (di, d k ) > maxj { min{R {d it dj), R {dj, d;)}}. 

Then R is transitive by the definition. 

Then R is an equivalence relation on Ds, which 
partitions the input document set Ds into set of 
equivalence classes. Equivalence relation seems a 
natural technique for duplicate document 
categorization. Any two documents in same 
equivalence class are related and are different if they 
are coming from two equivalence classes. The set of 
all equivalence classes induces the document set Ds. 
High syntactic similarity pairs of documents typically 
referred to as duplicates or near duplicates except 
diagonal elements. By using equivalence relation, 
easily we can identify the duplicate documents or we 
can perform the clustering on duplicate documents. 
Apart from the representation of feature document 
vector by MFI, we also need to consider that who is 
the author of document, when the document was 
created, where it is available, helps in effectively 
finding the duplicate documents. Each document in 
input Ds must belong to unique equivalence class. If R 

is equivalence relation on Ds = {dl, d2, d3, d 4 d n }. 

Then number of equivalence relations on Ds is always 
lies between n < | R|< n 2 . i.e the time complexity of 
calculating equivalence relation on Ds is 0(n 2 ). 
Choose the threshold a in equivalence relation as 0.8 
.i.e/ (di,dj) > 0.8. Since the matrix is symmetric, the 
documents sets {(d 3 , d-J, (d 3 , d 2 ), (d 4 , d 3 )} are 
mostly related. Hence the documents are near 
duplicates and grouping the documents into clusters 
thereby providing privacy of individual copy rights of 
documents. 
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Ro.8 = 



0 


0 


1 


0 


0 


0 


1 


0 


1 


1 


0 


1 


0 


0 


1 


0 



VI. CONCLUSION AND FUTURE SCOPE 

Cluster analysis can be used as powerful ,stranded 
alone data mining concept that gains insight 
information of knowledge from huge unstructured 
databases. Most conventional clustering methods do 
not satisfy the document clustering requirements such 
as high dimensionality, huge volumes and easy of 
accessing meaningful clusters labels. In this paper, we 
presented novel approach; Maximal frequent item set 
(MFI) Based Similarity Measure for Hierarchical 
Document Clustering to address these issues. 
Dimensionality reduction can be achieved through 
MFI. By using the same MFI similarity measure in 
hierarchal document clustering, the number of levels 
will be decreased. It is easy for browsing. Clustering 
has its paths in many areas, by applying MFI based 
techniques to clusters, including data mining, statistics, 
biology, and machine learning we can get the high 
quality of clusters. Moreover, by means of maximal 
frequent item sets, we can predict the most influenced 
objects of clusters in the entire dataset of applications 
like business, marketing, world wide web, social 
networking analysis. 
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